ORLA 6541 - Exercise 1

The following code and write-up contains answers to in-text exercises in Wickham & Grolemund (2017): Chapters 3-6 and Bulut & Dejardins (2019) Chapter 4.

Wickham & Grolemund (2017):

Chapter 3: Data Visualization

3.2.4 Exercises

1. Run ggplot(data = mpg). What do you see?

A blank grey square

library(tidyverse)

ggplot(data=mpg)

2. How many rows are in mpg? How many columns?

234 rows and 11 columns

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

3. What does the drv variable describe? Read the help for ?mpg to find out.

The drv variable describes the type of drive train for each car in the data set, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd.

?mpg

4. Make a scatterplot of hwy vs cyl.

ggplot(data = mpg)+geom_point(mapping=aes(x=cyl, y=hwy))

5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

Since both variables are categorical, the points align based on the intersection of both categories, which does not provide any useful information. Scatterplots are more useful for mapping the relationship between two continuous variables.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = class, y = drv))

## 3.3.1 Exercises #### 1. What’s gone wrong with this code? Why are the points not blue? ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = “blue”))

aes() is used to map an aesthetic to a variable. In this case, we wanted to associate the aesthetic color and with the actual color “blue”. Since “blue” is not a variable, we cannot map it in aesthetic, and had to set the aesthetic properties of our geom manually (outside of aes() . As such:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

?mpg

Categorical: manufacturer, model, trans, drv, fl, class Continuous: displ, year, cyl, cty, hwy

You can see this information when running ?mpg under each variable’s name

3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

Assigning color to a continuous variable provides a numerical scale of colors while assigning color to a categorical variable categorizes all levels of the variable by different colors.

Similarly, assigning size to a continuous variable provides a numerical scale of sizes, while assigning size to a categorical variable (although not advised), categorizes all levels of the variable by different sizes.

Continuous variables can not be mapped to shape. While assigning shape to a categorical variable, all levels of the variable by different shapes (with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate)

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = cty))

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, size = cty))

4. What happens if you map the same variable to multiple aesthetics?

You get a scatterpllot with an overlap of two aesthetics, both convey the same information.

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = cty, alpha=cty))

5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

It is used to modify the width of the border of shapes with borders.

?geom_point

6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

It categorizes the variable assigned to the x-axis (in this case, displ) to < or > 5 by two different colors.

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = displ<5))

3.5.1 Exercises

1. What happens if you facet on a continuous variable?

We get a new column for each value of the variable.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ cty, nrow = 2)

2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

The empty cells in facet_grid suggest that the variables did not intersect within these values. This relates to the first plot because we can notice that in the first plot, there is much empty space and if we were to facet it, we are likely to have empty cells as well.

3. What plots does the following code make? What does . do?

. allows us to facet the plot by all levels of the variable provided without needing to specify the levels.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

4. Take the first faceted plot in this section: What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Advantages: it is easier to identify patterns within each value of a variable; we can have a larger number of facets while still being able to identify which facet represents which value, while with a large number of colors, it may be difficult to discriminate between values. Disadvantages: It may be more challenging to compare patterns between values of the variable and to identify overall patterns.

5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

?facet_wrap

nrow determines the number of rows in the facet. Other options that control the layout of the individual panels are: ncol, scale, and dir. ncol determines the number of columns in the facet. facet_grid() does not have number of nrow and ncol arguments because they are determined by the number of unique levels the variables have.

6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Because if you put the variable with the more unique levels in the rows, it would create a taller then wider plot, which may be hard to follow and interpret on most computers.

3.6.1 Exercises

1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

line chart - geom_line() boxplot - geom_boxplot() histogram - geom_histogram()

2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

This will create a scatter plot with displ as the x variable and hwy as the y variable. drv will be represented by both lines and errors colored by the levels of the variable, and their standard error would not appear.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

show.legend = FALSE hides the legend box in the of the plot. It plausible that it was removed to ensure the plot’s size is consistent with previous plots.

4. What does the se argument to geom_smooth() do?

It visulizes the confidence intervals of the line as shaded areas; TRUE to show, FALSE to hide.

5. Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

No, they will be identical since Because both geom_point() and geom_smooth() will uses the same data and mapping noted in the ggplot(), which is the same for both plots.

6. Recreate the R code necessary to generate the following graphs.

ggplot(data=mpg, mapping=aes(x = displ, y = hwy,)) +
  geom_point(size=5) +
  geom_smooth(se = FALSE, size=2)

 ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
     geom_smooth(mapping = aes(group = drv), se = FALSE, size=2) +
  geom_point(size=5)

 ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv))+
     geom_smooth(se = FALSE, size=2) +
  geom_point(size=5)

ggplot(data = mpg, mapping = aes (x = displ, y = hwy))+
     geom_smooth(se = FALSE, size=2) +
  geom_point(mapping=aes(color=drv), size=5)

 ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
     geom_smooth(mapping=aes(linetype=drv), se = FALSE, size=2) +
  geom_point(mapping = aes(color=drv), size=5)

 ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
geom_point(size = 5, color = "white") +
  geom_point(aes(color = drv))

3.7.1 Exercises

1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

?stat_summary

the default geom associated with stat_summary is geom_pointrange().

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

2. What does geom_col() do? How is it different to geom_bar()?

?geom_col
?geom_bar

There are two types of bar charts: geom_bar() and geom_col(). geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.

3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Pairs of geoms and stats share names in common and typically have the same default stat.

4. What variables does stat_smooth() compute? What parameters control its behaviour?

?stat_smooth

Two parameters that control stat_smooth()'s behavior are method and formula. Method specifies the kind of smoothing function to apply, like a linear model (lm) or a generalized linear model (glm), and formula allows the user to provide the variables of interest for the given modeling strategy. In addition, users can also adjust the fullrange parameter, which tells R to either fit the smoothing for just the data or the full plot.

stat_smooth() generates the following variables:

y or x predicted value of the given variable(s)

ymin or xmin lower pointwise confidence interval around the mean

ymax or xmax upper pointwise confidence interval around the mean

se standard error

5.In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

The problems is that all of the bars are the same height, which is inaccurate. Setting group=1 ensures that the proportions for each stack are calculated using each subset.

3.8.1 Exercises

1. What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

It is possible that many points in the plot overlap and, therefore, hide each other. We can improve the plot by using the geom_jitter() function to add a small amount of random noise to each point, which will reveal any overlapping points. As such:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter()

2. What parameters to geom_jitter() control the amount of jittering?

Width, controlling the horizontal random noise, and height, controlling the vertical random noise.

?geom_jitter

3. Compare and contrast geom_jitter() with geom_count().

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_count()

While geom_jitter adds random noise to each point to reduce over plotting of points, geom_count changes the size of each point per the number of observation for its specific value. Although geom_count can help to reduce over plotting; however, it may create over plotting by itself due to the size of the points.

4. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

?geom_boxplot

The default position adjustment for geom_boxplot() is dodge2.

ggplot(data = mpg, aes(x = drv, y = hwy)) +
  geom_boxplot()

3.9.1 Exercises

1. Turn a stacked bar chart into a pie chart using coord_polar().

ggplot(mpg, aes(x = class, fill = drv)) + geom_bar()

ggplot(mpg, aes(x = class, fill = drv)) + geom_bar() + coord_polar()

2. What does labs() do? Read the documentation.

?labs()

labs() allows users to set plot labels. Within the labs() function, users can provide details for host of plot labels, like title, subtitle, caption, and more.

3. What’s the difference between coord_quickmap() and coord_map()?

?coord_quickmap

coord_map() from ggplot2 allows users to map their data onto a 2D projection of the earth. Due to the curvature of the earth, the documentation notes that map projections do not preserve straigth lines. As such, coord_quickmap() can be used alternatively, which performs calculations to provide an approximation of the requested area of the globe but with straight lines.

4. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

The plot below suggests a strong positive relationship between city miles per gallon to highway miles per gallon; as one rises, so does that other.

coord_fixed() is important because it ensures a fixed ratio between the physical representation of data points on the X and Y axes. This makes it easier to identify patterns in the plot.

geom_abline() adds a reference line to a plot, either horizontal, vertical, or diagonal, which is useful for annotating plots. In this case, it added a horizontal line to the plot, which made it easier to identify the relationship between city and highway mpg.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  coord_fixed()

Wickham & Grolemund Chapter 4

Workflow: basics

4.4 Exercises

1. Why does this code not work?

my_variable <- 10 my_varıable #> Error in eval(expr, envir, enclos): object ‘my_varıable’ not found

This code does not work because the “I” in the second row of the code is capital, which does not match the “i” above.

2. Tweak each of the following R commands so that they run correctly:

library(tidyverse)

ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

fliter(mpg, cyl = 8) filter(diamond, carat > 3)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

filter(mpg, cyl == 8)
## # A tibble: 70 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a6 quattro   4.2  2008     8 auto… 4        16    23 p     mids…
##  2 chevrolet    c1500 sub…   5.3  2008     8 auto… r        14    20 r     suv  
##  3 chevrolet    c1500 sub…   5.3  2008     8 auto… r        11    15 e     suv  
##  4 chevrolet    c1500 sub…   5.3  2008     8 auto… r        14    20 r     suv  
##  5 chevrolet    c1500 sub…   5.7  1999     8 auto… r        13    17 r     suv  
##  6 chevrolet    c1500 sub…   6    2008     8 auto… r        12    17 r     suv  
##  7 chevrolet    corvette     5.7  1999     8 manu… r        16    26 p     2sea…
##  8 chevrolet    corvette     5.7  1999     8 auto… r        15    23 p     2sea…
##  9 chevrolet    corvette     6.2  2008     8 manu… r        16    26 p     2sea…
## 10 chevrolet    corvette     6.2  2008     8 auto… r        15    25 p     2sea…
## # … with 60 more rows
filter(diamonds, carat > 3)
## # A tibble: 32 × 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  3.01 Premium I     I1       62.7    58  8040  9.1   8.97  5.67
##  2  3.11 Fair    J     I1       65.9    57  9823  9.15  9.02  5.98
##  3  3.01 Premium F     I1       62.2    56  9925  9.24  9.13  5.73
##  4  3.05 Premium E     I1       60.9    58 10453  9.26  9.25  5.66
##  5  3.02 Fair    I     I1       65.2    56 10577  9.11  9.02  5.91
##  6  3.01 Fair    H     I1       56.1    62 10761  9.54  9.38  5.31
##  7  3.65 Fair    H     I1       67.1    53 11668  9.53  9.48  6.38
##  8  3.24 Premium H     I1       62.1    58 12300  9.44  9.4   5.85
##  9  3.22 Ideal   I     I1       62.6    55 12545  9.49  9.42  5.92
## 10  3.5  Ideal   H     I1       62.8    57 12587  9.65  9.59  6.03
## # … with 22 more rows

3. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

A menu with all shortcuts appear. I can access this menu through Tools, keyboard shortcuts helps.

library(tidyverse) 

Bulut & Dejardins (2019) Chapter 4

Wrangling big data with data.table

This chapter demonstrates the application of the data.table in R as an alternative to tidyverse and base R methods of data wrangling. The authors suggest that data.table is particularly useful for working with large volumes of data (e.g. 10 - 100 GB)

library(data.table) 

4.2.1 Exercises

4.3.1 Exercises

1. Subset all the Female students (ST004D01T) in Germany.

2. How many female students are there in Germany?

3. The .N function returns the length of a vector/number of rows. Use chaining with the .N function to answer Exercise 2.

Using data.table syntax to subset the data set, we found that there are 3,197 female students from Germany in the data set. In addition, an alternative method using the .N function to return the number of rows is shown below as well.

# explore data; first 5 and last 5 rows returned when printing a data.table object 
#random6 # don't run for now due to loading speed 

# subset all female students from the field ST004D01T 

#random6[CNT == "Germany" & ST004D01T == "Female"]

# return row count with .N function and chaining 
#random6[CNT == "Germany" & ST004D01T == "Female", .N] # don't print code as the output is very large in the markdown report 

4.4.1 Exercises

1. The computer and software variables that were created above ask a student whether they had a computer in their home that they can use for school work (computer) and whether they had educational software in their home (software). Find the proportion of students in the Germany and Uruguay that have a computer in their home or have educational software.

# first create custom function to convert bin to numeric  

bin.to.num <- function(x){
  if (is.na(x)) NA
  else if (x == "Yes") 1L
  else if (x == "No") 0L
}


# using the custom function, create new variables, computer and software, that we can use to calculate the proportion of students with access to these resources  


random6[, `:=` 
     (female = ifelse(ST004D01T == "Female", 1, 0),
       sex = ST004D01T,
       
       # At my house we have ...
       desk = sapply(ST011Q01TA, bin.to.num),
       own.room = sapply(ST011Q02TA, bin.to.num),
       quiet.study = sapply(ST011Q03TA, bin.to.num),
       computer = sapply(ST011Q04TA, bin.to.num),
       software = sapply(ST011Q05TA, bin.to.num),
       internet = sapply(ST011Q06TA, bin.to.num),
       lit = sapply(ST011Q07TA, bin.to.num),
       poetry = sapply(ST011Q08TA, bin.to.num),
       art = sapply(ST011Q09TA, bin.to.num),
       book.sch = sapply(ST011Q10TA, bin.to.num),
       tech.book = sapply(ST011Q11TA, bin.to.num),
       dict = sapply(ST011Q12TA, bin.to.num),
       art.book = sapply(ST011Q16NA, bin.to.num))]

In Germany, 95% of students in Germany have a computer in their home and 45% of students have educational software.

random6[CNTRYID == "Germany", table(computer)] # 247 No's and 5438 Yes' 
## computer
##    0    1 
##  247 5438
# computer prop 
5438 /(5438 + 247)*100 #  95% of students in data set in Germany have a computer in
## [1] 95.65523
random6[CNTRYID == "Germany", table(software)] # 3038 No's and 2481 Yes' 
## software
##    0    1 
## 3038 2481
# educational software prop 
2481/(2481 + 3038) # 45% have access to educational software 
## [1] 0.449538

In Uruguay, 88% of students have a computer in the home and 43% of students have educational software.

random6[CNTRYID == "Uruguay", table(computer)] # 635 No's and 5099 Yes' 
## computer
##    0    1 
##  635 5099
# computer in home prop 
5099/(5099 + 635) # 88% have access to educational software 
## [1] 0.8892571
random6[CNTRYID == "Uruguay", table(software)] # 3013 No's and 2313 Yes' 
## software
##    0    1 
## 3013 2313
# educational software prop 
2313/(2313 + 3013) # 43% have access to educational software 
## [1] 0.4342846

2. For just female students, find the proportion of students who have their own room (own.room) or a quiet place to study (quiet.study).

Among female students in the random6 data set, 72% stated that they have their own room and 85% stated that they have a quiet place to study.

random6[female == 1, table(own.room)] # 4805 No's and 12644 Yes' 
## own.room
##     0     1 
##  4805 12644
# prop own.room

12644 / (12644 + 4805) #72% 
## [1] 0.7246261
random6[female == 1, table(quiet.study)] #2606 No's and 14801 Yes' 
## quiet.study
##     0     1 
##  2606 14801
# prop quiet.study 

14801 / (14801 + 2606) # 85% 
## [1] 0.8502901

4.5.1 Exercises

1. Calculate the proportion of students who have art in their home (art) and the average age (AGE) of the students by gender.

51% of students have art in their home. The average age of male and female students are 15.8 years.

# proportion of students who have art in their homes: 
random6[, .(art = mean(art, na.rm = TRUE), AGE = mean(AGE, na.rm = TRUE)), by = .(sex)] 
##       sex       art      AGE
## 1: Female 0.5061029 15.81118
## 2:   Male 0.4981741 15.80671

2. Within a by argument you can discretize a variable to create a grouping variable. Perform a median split for age within the by argument and assess whether there are age difference associated with having your own room (own.room) or a desk (desk).

# create custom function to split a column of data by above or below median 
median_cut = function(x) { 
  ifelse(x > median(x), "above_median", "below_median")
}


random6[,
     .(own.room = mean(own.room, na.rm = TRUE), 
       desk_mean = mean(desk, na.rm = TRUE)), 
     by = .(median_cut(AGE)) ] 
##      median_cut  own.room desk_mean
## 1: above_median 0.7439972 0.8684374
## 2: below_median 0.7441766 0.8680203

The authors show that the code below, using the melt() function, can be used to reshape the data sets from wide to long formats.

4.8 Lab

1. This afternoon when we discuss supervised learning, we’ll ask you to develop some models to predict the response to the question Do you expect your child will go into a ?” (PA032Q03TA). Recode this variable so that a “Yes” is 1 and a “No” is a -1 and save the variable as sci_car.

# recode and store as 'sci_car'; use table to confirm re-coding
random6[, 
        "sci_car" := sapply(PA032Q03TA, 
                            function(x) { 
                              if (is.na(x)) NA
                              else if (x == "No") -1L 
                              else if (x == "Yes") 1L 
                              
                              }) ][,
                                    table(sci_car)]
## sci_car
##   -1    1 
## 5104 4946

2. Calculate descriptives for this variable by sex and country. Specifically, the proportion of test takers whose parents said “Yes” or 1.

Because we are using the more limited random6 data set, we specify to return Germany and Mexico specifically, since the other countries are not in the data set.

random6[CNTRYID %in% c("Germany", "Mexico"), 
        .(mean(sci_car, na.rm = TRUE)), 
        by = .(sex, CNTRYID)]
##       sex CNTRYID         V1
## 1: Female Germany -0.8348018
## 2:   Male Germany -0.7085292
## 3:   Male  Mexico  0.3975610
## 4: Female  Mexico  0.3200577

3. Means and standard deviations (sd) for the variables that you think will be most predictive of sci_car.

# means and sd of variables hypothesized to predict sci_car
random6[, 
        .(environ_avg =  mean(ENVAWARE, na.rm = TRUE), 
          environ_sd = sd(ENVAWARE, na.rm = TRUE), 
          joy_avg = mean(JOYSCIE, na.rm = TRUE), 
          joy_sd = sd(JOYSCIE, na.rm = TRUE), 
          self_eff_avg = mean(SCIEEFF, na.rm = TRUE), 
          self_eff_sd = sd(SCIEEFF, na.rm = TRUE)
          )]
##    environ_avg environ_sd    joy_avg   joy_sd self_eff_avg self_eff_sd
## 1:  -0.1688647   1.097327 0.06985556 1.117806 -0.008148365    1.204854

4. Calculate these same descriptives by groups (by sci_car and by sex).

# means and sd of focal variables, grouped by sci_car and sex 
random6[, 
        .(environ_avg =  mean(ENVAWARE, na.rm = TRUE), 
          environ_sd = sd(ENVAWARE, na.rm = TRUE), 
          joy_avg = mean(JOYSCIE, na.rm = TRUE), 
          joy_sd = sd(JOYSCIE, na.rm = TRUE), 
          self_eff_avg = mean(SCIEEFF, na.rm = TRUE), 
          self_eff_sd = sd(SCIEEFF, na.rm = TRUE)
          ), by = .(sci_car, sex)]
##    sci_car    sex environ_avg environ_sd     joy_avg    joy_sd self_eff_avg
## 1:      NA Female -0.31842610  0.9874307 -0.09216258 1.1173348 -0.171922907
## 2:      -1 Female -0.01864386  0.9728241 -0.11241710 1.0385336  0.016697673
## 3:       1   Male  0.18221720  1.1622236  0.55847968 1.0094256  0.361605273
## 4:      -1   Male  0.06017463  1.1645145  0.12967621 1.0792426  0.162395292
## 5:      NA   Male -0.20814244  1.1916179  0.06879025 1.1356331 -0.009655823
## 6:       1 Female  0.08936607  0.9659264  0.55664182 0.9397745  0.311395444
##    self_eff_sd
## 1:    1.186161
## 2:    1.088183
## 3:    1.107026
## 4:    1.141781
## 5:    1.267313
## 6:    1.059443

5. Calculate correlations between these variables and sci_car

# first store smaller data set containing observations for all focal variables and sci_car 
sci_car_variables <- random6[, 
        .(sci_car, ENVAWARE, JOYSCIE, SCIEEFF)]

sci_car_variables[, 
                   .(env_joy_cor = cor(x = ENVAWARE, y = JOYSCIE, use = "complete.obs"),
                   joy_self_eff_cor = cor(x = JOYSCIE, y = SCIEEFF, use = "complete.obs"), 
                   env_sci_car_cor = cor(x = ENVAWARE, y = sci_car, use = "complete.obs"), 
                   sci_car_joy_cor = cor(x = JOYSCIE, y = sci_car, use = "complete.obs"))]
##    env_joy_cor joy_self_eff_cor env_sci_car_cor sci_car_joy_cor
## 1:   0.3650092        0.3482787      0.05626268       0.2666819

6. Create new variables: Discretize the math and reading variables using the OECD means (490 for math and 493) and code them as 1 (at or above the mean) and -1 (below the mean), but do in the data.table way without using the $ operator.

# recode and store as 'sci_car'; use table to confirm re-coding
random6[, 
        "math_split" := sapply(PV1MATH, 
                            function(x) { 
                              if (is.na(x)) NA
                              else if (x <= 490) -1L 
                              else if (x > 490) 1L 
                              
                              })][,
                                    table(math_split)]
## math_split
##    -1     1 
## 21578 14269
random6[, 
               "reading_split" := sapply(PV1READ, 
                            function(x) { 
                              if (is.na(x)) NA
                              else if (x <= 493) -1L 
                              else if (x > 493) 1L}) ][, 
                                                     table(reading_split)]
## reading_split
##    -1     1 
## 21191 14656
# calculate correlations with these new variables and those above 
# store as table for subsequent formatting 
math_reading_cor <- random6[, 
                   .(env_joy_cor = cor(x = ENVAWARE, y = JOYSCIE, use = "complete.obs"),
                   joy_self_eff_cor = cor(x = JOYSCIE, y = SCIEEFF, use = "complete.obs"), 
                   env_sci_car_cor = cor(x = ENVAWARE, y = sci_car, use = "complete.obs"), 
                   sci_car_joy_cor = cor(x = JOYSCIE, y = sci_car, use = "complete.obs"), 
                   math_joy_cor = cor(x = math_split, y = sci_car, use = "complete.obs"), 
                   reading_joy_cor = cor(x = reading_split, y = sci_car, use = "complete.obs"), 
                   math_env_cor = cor(x = ENVAWARE, y = math_split, use = "complete.obs"), 
                   reading_env_cor = cor(x = JOYSCIE, y = reading_split, use = "complete.obs"), 
                   math_self_eff_cor = cor(x = SCIEEFF, y = math_split, use = "complete.obs"), 
                  reading_self_eff_cor = cor(x = SCIEEFF, y = reading_split, use = "complete.obs"))] 

# pivot long using melt(); easier to view correlations this way 
melt(math_reading_cor) 
##                 variable       value
##  1:          env_joy_cor  0.36500924
##  2:     joy_self_eff_cor  0.34827868
##  3:      env_sci_car_cor  0.05626268
##  4:      sci_car_joy_cor  0.26668187
##  5:         math_joy_cor -0.21986258
##  6:      reading_joy_cor -0.20404487
##  7:         math_env_cor  0.12821305
##  8:      reading_env_cor  0.05982742
##  9:    math_self_eff_cor  0.02318556
## 10: reading_self_eff_cor  0.04272108

7. Chain together a set of operations: For example, create an intermediate variable that is the average of JOYSCIE and INTBRSCI, and then calculate the mean by country by sci_car through chaining.

random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, .SDcols = c("JOYSCIE", "INTBRSCI")][,,by = sci_car]
## Warning in `[.data.table`(random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, :
## Ignoring by= because j= is not supplied
## Warning in `[.data.table`(random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, :
## i and j are both missing so ignoring the other arguments. This warning will be
## upgraded to error in future.
##        CNTRYID     Mean
##     1: Germany -0.68550
##     2: Germany -1.62215
##     3: Germany -0.97025
##     4: Germany  1.03440
##     5: Germany  0.06700
##    ---                 
## 35843: Uruguay  1.16175
## 35844: Uruguay  0.22755
## 35845: Uruguay -1.29245
## 35846: Uruguay -0.31095
## 35847: Uruguay -0.54705

8. Transform variables, specifically recode MISCED and FISCED from characters to numeric variables.

# create column subset to convert to numeric
cols_to_convert <- c("MISCED", "FISCED") 

# convert to numeric with subset index and using lapply() 
random6[ , 
           (cols_to_convert) := lapply(.SD, as.numeric),
           .SDcols = c("MISCED", "FISCED")]
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion

## Warning in lapply(.SD, as.numeric): NAs introduced by coercion

9. Examine other variables in the pisa data set that you think might be predictive of PA032Q03TA.

library(psych) 
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
random6[, lapply(.SD, mean), .SDcols = c("DISCLISCI", "TEACHSUP", "IBTEACH", "TDTEACH")]
##    DISCLISCI TEACHSUP IBTEACH TDTEACH
## 1:        NA       NA      NA      NA
# run descriptive statistics for three additional variables hypothesized to relate to PA032Q03TA 
descriptive_stats <- random6[ ,lapply(.SD, psych::describe), 
         .SDcols = c("DISCLISCI", "TEACHSUP","TDTEACH")] 

# reshape table into long format so that it is easier to read the output
melt(descriptive_stats) 
## Warning in melt.data.table(descriptive_stats): id.vars and measure.vars are
## internally guessed when both are 'NULL'. All non-numeric/integer/logical type
## columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
##               variable         value
##  1:     DISCLISCI.vars  1.000000e+00
##  2:        DISCLISCI.n  3.140600e+04
##  3:     DISCLISCI.mean  1.445539e-01
##  4:       DISCLISCI.sd  1.002982e+00
##  5:   DISCLISCI.median  3.900000e-03
##  6:  DISCLISCI.trimmed  1.504836e-01
##  7:      DISCLISCI.mad  1.032186e+00
##  8:      DISCLISCI.min -2.416200e+00
##  9:      DISCLISCI.max  1.883700e+00
## 10:    DISCLISCI.range  4.299900e+00
## 11:     DISCLISCI.skew -1.627706e-01
## 12: DISCLISCI.kurtosis -2.219641e-01
## 13:       DISCLISCI.se  5.659616e-03
## 14:      TEACHSUP.vars  1.000000e+00
## 15:         TEACHSUP.n  3.062000e+04
## 16:      TEACHSUP.mean  1.259763e-01
## 17:        TEACHSUP.sd  9.746136e-01
## 18:    TEACHSUP.median  4.540000e-02
## 19:   TEACHSUP.trimmed  1.846121e-01
## 20:       TEACHSUP.mad  9.899320e-01
## 21:       TEACHSUP.min -2.719500e+00
## 22:       TEACHSUP.max  1.447500e+00
## 23:     TEACHSUP.range  4.167000e+00
## 24:      TEACHSUP.skew -3.721514e-01
## 25:  TEACHSUP.kurtosis -1.845543e-01
## 26:        TEACHSUP.se  5.569675e-03
## 27:       TDTEACH.vars  1.000000e+00
## 28:          TDTEACH.n  3.012000e+04
## 29:       TDTEACH.mean -2.597529e-02
## 30:         TDTEACH.sd  9.603662e-01
## 31:     TDTEACH.median -1.170000e-02
## 32:    TDTEACH.trimmed -3.303284e-02
## 33:        TDTEACH.mad  8.873361e-01
## 34:        TDTEACH.min -2.447600e+00
## 35:        TDTEACH.max  2.078100e+00
## 36:      TDTEACH.range  4.525700e+00
## 37:       TDTEACH.skew -7.223483e-03
## 38:   TDTEACH.kurtosis  3.790284e-01
## 39:         TDTEACH.se  5.533621e-03
##               variable         value